深度卷积神经网络(CNN)最近已达到最先进的手写文本识别(HTR)性能。但是,最近的研究表明,典型的CNN的学习性能是有限的,因为它们是具有简单(线性)神经元模型的同质网络。由于它们的异质网络结构结合了非线性神经元,最近提出了操作神经网络(ONNS)来解决这一缺点。自我结合是具有生成神经元模型的ONN的自组织变化,可以使用泰勒近似来生成任何非线性函数。在这项研究中,为了提高HTR的最新性能水平,提出了新型网络模型核心中的2D自组织(自我强调)。此外,本研究中使用了可变形的卷积,最近被证明可以更好地解决写作风格的变化。 IAM英语数据集和Hadara80p阿拉伯数据集中的结果表明,具有自我影响的操作层的拟议模型显着提高了字符错误率(CER)和单词错误率(WER)。与同行CNN相比,Hadara80p中的自我强调将CER和3.4%降低,在IAM数据集中,自我强调将CER降低1.2%和3.4%,为0.199%和1.244%。基准IAM上的结果表明,与自相互紧缩的操作层的拟议模型通过显着的边缘优于最近的深CNN模型,而使用具有可变形卷积的自我冲突表明了出色的结果。
translated by 谷歌翻译
经典图像去噪方法利用非本地自相似原理来有效地从嘈杂的图像中恢复图像内容。目前的最先进的方法使用深卷积神经网络(CNNS),以有效地学习从嘈杂到清洁图像的映射。深度去噪CNNS表现出高学习能力,并集成了由于大量隐藏层所产生的大型接收领域而整合非本地信息。然而,深网络也是计算复杂的并且需要大数据进行培训。为了解决这些问题,本研究旨在通过一种新的神经元模型赋予自组织的操作神经网络(自我onns)的重点,该模型可以通过紧凑且浅的模型实现类似或更好的去噪性能。最近,已经引入了超神经元的概念,其通过利用未局限性的内核位置来增强生成神经元的非线性变换,以获得增强的接受场大小。这是赋予深度网络配置需求的关键成就。由于已知非本地信息的整合受益于去噪,在这项工作中,我们研究了超神经元对合成和现实世界图像去噪的使用。我们还讨论了在GPU上实施超神经元模型的实际问题,并提出了非本地化操作的异质性与计算复杂性之间的权衡。我们的结果表明,具有相同的宽度和深度,具有超级神经元的自动onn,具有对具有生成和卷积神经元的网络的去噪性能,为脱结任务提供了显着的促进。此外,结果表明,具有超神经元的自串,可以分别为合成和真实世界的众所周知的众所周知的深层CNN去噪者达到竞争和优越的合成表演。
translated by 谷歌翻译
Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our approach, LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. Moreover, LC-PCFG results in an over 50% reduction in parameter count, and speedups in training time of 1.7x for image-aided models and more than 5x for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need of multi-modality for the task.
translated by 谷歌翻译
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks. Code will be available soon.
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译
We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic properties; (1) image-goals are sampled from random locations which can lead to ambiguity (e.g., looking at walls), and (2) image-goals match the camera specification and embodiment of the agent; this rigidity is limiting when considering user-driven downstream applications. We present the Instance-specific ImageNav task (InstanceImageNav) to address these limitations. Specifically, the goal image is 'focused' on some particular object instance in the scene and is taken with camera parameters independent of the agent. We instantiate InstanceImageNav in the Habitat Simulator using scenes from the Habitat-Matterport3D dataset (HM3D) and release a standardized benchmark to measure community progress.
translated by 谷歌翻译
National Health and Nutritional Status Survey (NHANSS) is conducted annually by the Ministry of Health in Negara Brunei Darussalam to assess the population health and nutritional patterns and characteristics. The main aim of this study was to discover meaningful patterns (groups) from the obese sample of NHANSS data by applying data reduction and interpretation techniques. The mixed nature of the variables (qualitative and quantitative) in the data set added novelty to the study. Accordingly, the Categorical Principal Component (CATPCA) technique was chosen to interpret the meaningful results. The relationships between obesity and the lifestyle factors like demography, socioeconomic status, physical activity, dietary behavior, history of blood pressure, diabetes, etc., were determined based on the principal components generated by CATPCA. The results were validated with the help of the split method technique to counter verify the authenticity of the generated groups. Based on the analysis and results, two subgroups were found in the data set, and the salient features of these subgroups have been reported. These results can be proposed for the betterment of the healthcare industry.
translated by 谷歌翻译
In this work, we show how to learn a visual walking policy that only uses a monocular RGB camera and proprioception. Since simulating RGB is hard, we necessarily have to learn vision in the real world. We start with a blind walking policy trained in simulation. This policy can traverse some terrains in the real world but often struggles since it lacks knowledge of the upcoming geometry. This can be resolved with the use of vision. We train a visual module in the real world to predict the upcoming terrain with our proposed algorithm Cross-Modal Supervision (CMS). CMS uses time-shifted proprioception to supervise vision and allows the policy to continually improve with more real-world experience. We evaluate our vision-based walking policy over a diverse set of terrains including stairs (up to 19cm high), slippery slopes (inclination of 35 degrees), curbs and tall steps (up to 20cm), and complex discrete terrains. We achieve this performance with less than 30 minutes of real-world data. Finally, we show that our policy can adapt to shifts in the visual field with a limited amount of real-world experience. Video results and code at https://antonilo.github.io/vision_locomotion/.
translated by 谷歌翻译
Unmanned aerial vehicles (UAVs) with on-board cameras are widely used for remote surveillance and video capturing applications. In remote virtual reality (VR) applications, multiple UAVs can be used to capture different partially overlapping angles of the ground target, which can be stitched together to provide 360{\deg} views. This requires coordinated formation of UAVs that is adaptive to movements of the ground target. In this paper, we propose a joint UAV formation and tracking framework to capture 360{\deg} angles of the target. The proposed framework uses a zero touch approach for automated and adaptive reconfiguration of multiple UAVs in a coordinated manner without the need for human intervention. This is suited to both military and civilian applications. Simulation results demonstrate the convergence and configuration of the UAVs with arbitrary initial locations and orientations. The performance has been tested for various number of UAVs and different mobility patterns of the ground target.
translated by 谷歌翻译
Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.
translated by 谷歌翻译